-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[opt] Enable CFG optimization for local tensors #3237
Conversation
✔️ Deploy Preview for jovial-fermat-aa59dc canceled. 🔨 Explore the source changes: 0bd8dd7 🔍 Inspect the deploy log: https://app.netlify.com/sites/jovial-fermat-aa59dc/deploys/61704154ccbbcc00087fc383 |
/format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
(Just curious, is it possible that there is a local tensor without an initial value?)
Current it is impossible. But maybe we will support that in the future, and the CFG part will be updated accordingly. |
Related issue = #2590, #2637, #3218, #3228
Although #2637 introduced local tensors, related CFG optimization was not implemented, resulting in redundant local memory allocation/load/store in many cases. This PR enables CFG optimization for local tensors, and eliminates the overhead of #3218 and #3228. Let's look at a tiny example:
Before this PR, the final IR for the kernel is:
After this PR, the final IR for this kernel is:
Details:
PtrOffsetStmt
, which is able to producedefinitely same/different
results.TensorType alloca
as a store to enable store-to-load forwarding. This is because currently local tensors must be initialized with values (ti.Vector([1, 2, 3])
). Then we don't treatTensorType alloca
itself as a valid forwarding source.PtrOffsetStmt
with analloca
origin to other offloaded tasks (it shouldn't appear in final nodelive_in
) to enable dead store elimination.